Inferring the origin of an epidemy with dynamic message-passing algorithm 
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We study the problem of estimating the origin of an epidemic outbreak - given a contact network 
and a snapshot of epidemic spread at a certain time, determine the infection source. Finding 
the source is important in different contexts of computer or social networks. We assume that 
the epidemic spread follows the most commonly used susceptible-infected-recovered model. We 
introduce an inference algorithm based on dynamic message-passing equations, and we show that 
it leads to significant improvement of performance compared to existing approaches. Importantly, 
this algorithm remains efficient in the case where one knows the state of only a fraction of nodes. 
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Introduction: Understanding and controlling the 
spread of epidemics on networks of contacts is an im- 
portant task of todays science. It has far-reaching appli- 
cations in mitigating the results of epidemics caused by 
infectious diseases, computer viruses, rumor spreading in 
social media and others. In the present article we address 
the problem of estimation of the origin of the epidemic 
outbreak (the so-called patient zero, or infection source - 
in what follows, these terms are used interchangingly) : 
given a contact network and a snapshot of epidemic 
spread at a certain time, determine the infection source. 
Information about the origin could be extremely useful to 
reduce or prevent future outbreaks. Whereas the dynam- 
ics and prediction of epidemic spreading in networks has 
attracted a considerable number of works, for a review 
see [H-Q i the problem of estimation of the epidemic ori- 
gin has been mathematically formulated only recently [J] , 
followed by a burst of research on this practically impor- 
tant problem [H4H1 . In order to make the estimation of 
the origin of spreading a well defined problem we need 
to have some knowledge about the spreading mechanism. 
We shall adopt here the same framework as in existing 
works, namely we assume that the epidemic spread fol- 
lows the widely used susceptible- infected-recovered (SIR) 
model or some of its special cases [l2j . 

The stochastic nature of infection propagation makes 
the estimation of epidemic origin intrinsically hard: in- 
deed, different initial conditions can lead to the same 
configuration at the observation time. Finding an es- 
timator that maximizes the probability of the observed 
configuration is in general computationally intractable, 
except in very special cases such as the case where the 
contact network is a line or a regular tree [3, 0, [ll| . The 
methods that have been studied in the existing works 
are mostly based on various kinds of graph centrality 
measures. Examples include the distance centrality or 
the Jordan center of a graph 0-0] ■ The problem was 
generalized to estimation of a set of epidemic origins us- 
ing spectral methods in [1, . Another line of approach 
uses more involved information about the epidemic than 
a snapshot at a given time [l(| . 



In this paper we introduce a new algorithm for the es- 
timation of the origin of an SIR epidemic from the knowl- 
edge of the network and the snapshot of some nodes at 
a certain time. Our algorithm estimates the probability 
that the observed snapshot resulted from a given patient 
zero in a way which is crucially different from existing 
approaches. For every possible origin of the epidemy, we 
use a fast dynamic message-passing method to estimate 
the probability that a given node in the network was 
in the observed state (S, I or R). We then use a mean- 
field-like approximation to compute the probability of the 
observed snapshot as a product of the marginal proba- 
bilities. We finally rank the possible origins according to 
that probability. 

The dynamic message-passing (DMP) algorithm that 
we use to estimate the probability of a given node to be 
in a given state is interesting in itself. It is based on 
dynamic equations that were first suggested in a differ- 
ent (and not straightforwardly tractable) form in [l3[ . If 
averaged over a graph ensemble it leads to the asymp- 
totically exact dynamic equations of [l4|, EH for SIR, 
or to those of .lfjj for avalanches in the random field 
Ising model. Note, that DMP, although it bares some 
similarity with the standard belief propagation (BP) 
method [l?], EH > 1S crucially different from BP since it 
does not derive from a Boltzmann-like probability distri- 
bution. It does not need to be iterated till convergence, 
instead the iteration time corresponds directly to the real 
time in the associated SIR dynamics. A nice property 
that DMP shares with BP is that it is exact if the con- 
tact network is a tree. We use it as an approximation for 
loopy but sparse contact networks in the same way that 
BP is commonly used with success in such situations. 

We test our algorithm on synthetic spreading data and 
show that it performs better than existing approaches 
(except for a special region of parameters where the Jor- 
dan center is on average better). We find that the algo- 
rithm is very robust, for instance it remains efficient even 
in the case when the state of only a fraction of nodes in 
the network is observed. From our tests we also iden- 
tify regions of parameters where estimating the origin of 
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epidemic spreading is relatively easy and others where 
this problem hard. Our dataset can hence also serve as 
a test-bed for new approaches. 

SIR spreading model and inference of epidemic origin: 
Let G = (V, E) be a connected undirected graph contain- 
ing N nodes defined by the set of vertices V and the set 
of edges E. The SIR model for spreading of an epidemic 
is defined as follows: Each node i at discrete time t can 
be in one of three states qi(t): susceptible, $(t) = S, 
infected, qi(t) = /, or recovered, <&(t) = R. At each time 
step, an infected node will recover with probability fii, 
and a susceptible node i will become infected with prob- 
ability 1 — rifcedill - ^ki& qk (t).i)i where di is the set of the 
nodes neighboring node i, and Aki measures the efficiency 
of spread from k to i. The recovered nodes never change 
their state. We assume that the graph G and parameters 
Xij, fii are known (or have been inferred). The general 
properties and the phase diagram of this model on ran- 
dom networks were studied in many works, see e.g. Q 
and references therein. 

To define the problem of estimation of the epidemic 
origin we consider only the case where at initial time 
only one node is infected (this node will be referred to 
as the patient zero, io), and all others nodes are initially 
susceptible. After to > time steps (to is in general 
unknown), we observe the state of a set of nodes O C V, 
and the task is to estimate the location of patient zero 
based on this snapshot. 

Let us briefly explain two of existing algorithms [H, 
HI 0, that we will use as benchmarks. The authors 
of g 1 considered only the case when all the nodes 
were observed, O = V . A version of these algorithms 
for the more general case is suggested in appendix [B] 
The most basic measure for node i to be the epidemic 
origin is the distance centrality D(i) which we define as 

= Ej(Ee d ( i '.?) (<W + <^,flM - )j where the graph 
Q is a connected component of the original graph G con- 
taining all infected and recovered nodes and only them, 
and d(i,j) is the shortest path between node i and node 
j on the graph Q. The ad-hoc factor l/fJ,j is intro- 
duced to distinguish recovered nodes that for small \ij 
tend to be closer to the epidemic origin. In the existing 
works this factor was not present, because [1, Q treated 
only the SI model, and considered that susceptible 
and recovered nodes are indistinguishable. The authors 
of 0, H| suggested a "rumor centrality" estimator and 
showed that, for tree graphs, the rumor centrality and 
the distance centrality coincide. Another simple but well 
performing estimator, Jordan centrality J(i), was pro- 
posed in JjJ and corresponds to a node minimizing the 
maximum distance to other infected and recovered nodes: 
J(i) = maxjgg d(i,j). This estimator is known as Jordan 
center of Q in the graph theory literature. 

The core of the algorithm proposed in the present work 
is dynamic message-passing, explained in the next sec- 
tion, that is able to estimate for a given patient zero and 
a given observation time to what is the probability that a 
node i will be observed in a given state. For simplicity of 



the explanation let us first assume the time to is known. 
Let us call Pg(t,io) (respectively Pj(t,io), P 3 R (t,io)) the 
probability that node j was in state S (resp. in state / 
and R) at time t provided the patient zero was node io- 
With the use of Bayes rule, the probability that node i 
is the patient zero given the observed states is propor- 
tional to the joint probability of observed states given 
the patient zero, P(i\0) ~ P(0\i). We can also define an 
energy-like function of every node E(i) = — \ogP(<D\i), 
nodes with lower energy are then more likely to be the 
infection source. If one were able to compute P(0\i) ex- 
actly this would be an optimal inference scheme. How- 
ever, there is no tractable way to compute exactly the 
joint probability of the observations, hence we approxi- 
mate it using a mean-ficld-typc approach as a product 
of the marginal probabilities provided by the dynamic 
message-passing 

p(o\i)c n^(*o.i)n pi ^°^ n pr^)- a) 

keo leo t»eo 

g fc (t )=s q l {ta) = i <j m (to) = « 

Finally to estimate the right value of time to we need to 
compute energy E(i) for different possible values of to 
and choose the minimum one. 

Dynamic message-passing algorithm: Let us explain 
the dynamic message-passing equations for the SIR 
model. The proof that these equations are exact on 
trees and connections to some related existing results, 
namely [l3l - fl6l . [l9|-[22j , are discussed in appendix [B] Wc 
first define the message Pg~^ 3 (t) as the probability for 
node i to be in the state S at time t in the cavity graph 
in which node j has been removed. The quantity k ^ l (t) 
is the probability for node k not to pass the infection sig- 
nal to node i up to time t. and <p k ^ l (t) is the probability 
for node k to be in the state I and not to pass the in- 
fection to node i up to time t. The initial condition are 
0*;-h(O) = 1, and (f> k ^ l (0) = S qk ( )j. For more precise 
definitions see appendix [3] These messages satisfy the 
recursion rules: 

p^'(t+i)=pi(o) n 1), (2) 

kedi\j 

6 k ^\t + 1) - 6 k ^\t) = -AjK^-^i), (3) 

fc -"(t) = (1 - A W )(1 - Mfc)^*(* - 1)- 

-[P s k ^(t)-P k ^{t-l)]. (4) 

Here di\j means the set of nodes neighboring node i, 
excluding j. The marginal probabilities that node i is in 
a given state at time t are then given as 

pi(t+i)=pi(o) n **-*(*+!), (5) 

pUt+i) = pk(t)+t*iPm, (6) 

P}(t + l) = l-Pl s (t + l)-Pj i (t+l). (7) 

Hence the algorithmic complexity for computing the en- 
ergy E{i) of a given vertex i (and therefore the probabil- 
ity that it is the epidemic origin) is OitoNc), where c is 
the average degree of the graph. 
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Performance of inference algorithms: We first test 
our algorithm on random regular graphs, i.e. random 
graphs where every node has degree c. In all the ex- 
amples we consider uniform transmission and recovery 
probabilities Ay = A and \xi = \x. 
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FIG. 1. (color online) A test of inference of the epidemic origin 
on random regular graphs of degree c = 4, size N = 1000. In- 
set: An epidemic is generated with recovery probability fj, = 1, 
transmission probability A = 0.6, a snapshot of all the nodes 
is taken at time to = 8 (in this figure we assume we know the 
value of the time to), 242 nodes are observed to be in the I 
or R state. The dynamic message-passing is used to compute 
the energy of every node. This energy is finite for 43 nodes; 
it is plotted as a function of their rank r. The true patient 
zero is marked by a red cross, and its rank is r(irj) = 2 in this 
case. The main figure: An epidemic generated with (x = 1, 
A = 0.5, io = 5. Histogram (over 1000 random instances) 
of the normalized rank (i.e. the rank divided by the number 
of R or / nodes in the snapshot) of the true patient zero is 
plotted for the dynamic message-passing inference, as well as 
for the distance, rumor and Jordan centrality measures. 

In the first example, inset of Fig. [TJ we plot the en- 
ergy E(i) resulting from the dynamic message-passing of 
the nodes for which the probability of being the epidemic 
origin is finite according to our algorithm, we order the 
nodes according to the energy value. The true epidemic 
origin is marked with a red cross. Wc define the rank 
of candidates for the epidemic origin to be its position 
in this ranking (the lowest energy node having rank 0). 
In the main part of Fig. Q] we plot the histogram of nor- 
malized ranks (i.e. the rank divided by the total number 
of nodes that were observed as recovered or infected) of 
the true epidemic origin as obtained from our DMP in- 
ference algorithm, compared to the rankings obtained by 
distance, rumor and Jordan centralities. We see that the 
DMP inference algorithm considerably outperforms the 
three centrality measures, with a comparable computa- 
tional cost. 

In Fig. [5] we present the average normalized rank of the 
true epidemic origin for random regular graphs for the 



whole range of the transmission probability A, for differ- 
ent values of the recovery probability \i, and a snapshot 
of all the nodes at time to. We see that DMP inference 
outperforms the centrality measures, except in case (c) in 
a range of 0.3 < A < 0.58 where Jordan center is a better 
estimation. In other cases, however, Jordan centrality is 
less performant. Note that for /i < 1 our implementation 
of Jordan centrality does not distinguish between recov- 
ered and infected nodes, which partly explains its very 
bad performance in that case. We have also computed 
the results of the rumor centrality measure, according to 
our results it has never been systematically better than 
distance centrality. 

It is important to note that for some range of param- 
eters the average normalized rank of the true epidemic 
origin is not so close to zero (note that value 1/2 of the 
normalized rank corresponds to a random guess of patient 
zero among all the infected or recovered nodes). The 
problem of estimating the epidemic origin with a good 
precision is very hard in these regions. In some cases the 
information about the epidemic origin was lost during the 
spreading process: for instance for A > A c = fj,/ (c— 1) [23| 
the epidemic percolates at large times io ^ log c N. Then 
the information about the epidemic origin is lost. On the 
other hand for to < \og c N, Fig. [2] (b), the epidemic is 
confined to a tree network and in this case the inference 
of the origin is easier. In Fig. [2] we mostly focused on 
the intermediate case io ~ log c N. In our opinion the 
systematic comparison presented here is a good test-bed 
for comparing and improving algorithms. 

So far we have tested only cases where the states of 
all the nodes are observed in the snapshot. In practical 
situations it is more likely that only a fraction of nodes 
is observed. We have tested our algorithm in the case 
where a fraction £ of nodes is not observed. Fig. [3] (a) 
shows the average rank of the true epidemic origin. We 
chose parameters for which our algorithm compared the 
worse to the Jordan and distance centralities, general- 
ized to the case of incomplete snapshot as described in 
appendix [Bj The result of our test in Fig. [3] shows that 
with incomplete snapshots the DMP inference algorithm 
is outperforming both centralities, even in the case where 
for complete snapshots the Jordan centrality was better. 
Such a robustness is a very useful property. 

In the example depicted in Fig. [3] (b), we took the 
network of the U.S. East-Coast power grid which contains 
N = 4941 nodes with a maximum degree 19 (24|. We see 
that the DMP estimator gives better prediction for all 
range of A. 

Finally note that in our numerical tests with DMP we 
assumed so far the knowledge of the observation time 
io. To estimate the spreading time we use the following 
procedure: given a snapshot of epidemic spread, first use 
one of the distance centrality algorithm in order to select 
an estimation of the epidemic origin £g . This estimation 
does not need to be very precise. Then plot the values of 
the energy E(i^) as a function of time, the minimum of 
this function is maximizing the probability that zj$ was 
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FIG. 2. (color online) Average rank of the true epidemic origin on random regular graphs of size N = 1000 with degree c = 4. 
Each data point is averaged over 1000 instances. The snapshot time to and recovery probability fj, are from the left: (a) to = 10, 
(i = 0.5, (b) to = 5, jti = 1 and (c) to = 10, /i = 1. DMP estimator is given by red circles, Jordan centrality estimator is in 
green triangles, distance centrality estimator is in blue boxes. In a dotted line we plot the average fraction of nodes that were 
infected or recovered in the snapshot, \Q\/N, we use this number to normalize the ranks of the epidemic origin. 
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FIG. 3. (color online) Left: (a) Data for a random contact 
network of size N = 1000, degree c = 4. Recovery probability 
(i — 1, transmission probability A = 0.5 and A = 0.7, only 
the state of a fraction 1 — £ of nodes is observed at time 
to = 10. Rank (averaged over 1000 instances) of the true 
epidemic origin obtained with our DMP inference algorithm is 
compared to the distance and Jordan centralities. Right: (b) 
Normalized rank (averaged over 1000 instances) of the true 
epidemic origin for epidemic spreading with fi = 0.5, and all 
nodes observed at time to = 10, on the networks of the U.S. 
East-Coast power grid. DMP inference is significantly better 
than inference based on distance and Jordan centralities. 



the epidemic origin and lies generally very close to the 
true value of to. 

Our algorithm is based on an approximation to the 
Bayes optimal inference, and therefore it is not optimal. 
There are two possible sources of sub-optimality. The 
first is the fact that the message passing equations may 
lead to errors on loopy graphs. The second is the mean- 
field-like approximation ([1]) of the joint probability dis- 
tribution. We have observed that taking into account the 
two-point correlation in this approximation does not lead 
to any improvement in our results. It would be interest- 
ing to search for a better approximations of the likelihood 
on a general graph. 

Conclusion: The solution of dynamics of the SIR 
model in terms of message-passing equations allowed us 
to develop an efficient algorithm for detection of the epi- 
demic origin. Compared to existing algorithms, it gener- 
ically (except for a narrow range of parameters) provides 
an improved estimate for the source of infectious out- 
break and its performance is robust in the case where 
one has access to the status of only a fraction of the 
nodes in the network. 
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Appendix A: Dynamic message-passing for SIR model 



We present here a proof that the probabilities of being susceptible/infected/recovered at a given time t as provided 
by the dynamic message-passing (DMP) equations (PEJ) from the main part of this paper are exact for all initial 
conditions and every realization of the transmission and recovery probabilities A^- and /ij if the graph of contacts is 
a tree. Before giving the proof we start with a couple of remarks explaining relation to existing works. 



General remarks about DMP 



It should be noted that equations equivalent to (J2EJ) were first derived in [l3[ . The authors of [l3[ treated a more 
general SIR model where the transmission and recovery probability depends on the time when the node in question 
was infected. For this more general case an easily tractable form (i.e. the probabilities at time t give probabilities 
at time t + 1 via a set of simple closed equation) of the DMP is not known. The equations in [13| were instead 
written in a convolutional form that is rather complicated for numerical resolution. The authors noticed that when 
the probabilities of recovery and transmission are constant then the equations simplify, but did not write a version 
of the equations that is applicable on a given graph for a given initial condition (actually they only wrote equations 
averaged over a set of initial conditions). Hence we find is useful to provide the derivation of the DMP on a single 
graph in their simple iterative form. 

For the purpose of this paper we use the DMP on a single instance of the contact network for a given initial condition. 
However, if an ensemble of initial conditions is given as well as an ensemble of random graphs with a given probability 
distribution then one can write differential equations for the fraction of nodes that are susceptible/infected/recovered 
at a given time. These equations were first derived by [13] and appeared also in [l]| and [l5| . One should not confuse 
these averaged DMP equations with the "naive" mean field equations that are often written for the SIR model under 
the assumption of perfect mixing, see e.g. Q. Whereas the naive mean field equations provide only a very crude 
approximation for the real probabilities, the equations of [3 EH] are exact in the thermodynamic limit, TV — > oo, as 
long as in the random graph ensemble a random node docs not belong (with a high probability) to a loop of constant 
(in N) length. 

It is interesting to realize that the present DMP equations are applicable also for contact networks that arc changing 
in time. The generalization is straightforward, one only needs to encode the dynamics of the network into time- 
changing transmission probabilities Ajj (t) and use the equations ([2][7]) . The SIR model on dynamically changing 
networks has been already studied using the graph-averaged version of the DMP equations in [l9|, H(| ■ We anticipate 
that the DMP equations on a single _graph will also be useful for studies where specific experimental data about the 
changing network, such as those of [21(, can be used. 

Further wc want the reader to note the similarities and more importantly differences between DMP and BP[l7|. 
The common point for both DMP and BP is that they are both exact if the underlying network is a tree. The crucial 
difference is that BP is derived from a stationary Boltzmann-like probability distribution and only the fixed point of 
the BP equations has a physical meaning. Whereas in the DMP equations (presented here for the SIR model) every 
step of the iterations corresponds to the physical time in the underlying dynamical process. Note, however, that DMP 
can be derived from a "dynamic" belief propagation where variables in the corresponding graphical model are the 
whole trajectories of a given node, see e.g. [221 ] . Here we will present a more straightforward derivation. 



2. Derivation of DMP for the SIR model 



Here wc present the derivation of equations ([2][7]) for tree contact networks. Wc define P'g(t), P}(t) and P R (t) as 
marginal probabilities that qi(t) = 5, qi(t) = I and qi(t) = R. These marginals sum to one and thus 

pi(t+i) = i-Ph(t + i)-Pk(t+i)- (ai) 

Since the recovery process from state I to state R is independent of neighbors, we have 

P R (t+l) = P R (t) + fH Pj(t). (A2) 

The epidemic process on a graph can be interpreted as the propagation of infection signals from infected to sus- 
ceptible nodes. The infection signal d 1 ^ 3 (t) is defined as a random variable which is equal to one with probability 
5 q .n_i\jXij, and equal to zero otherwise. Consider an auxiliary dynamics Dj where the node j receives infection 
signals, but ignores them and thus is fixed to the S state at all times. Since the infection cannot propagate through 
the site j in this dynamic setting, different graph branches rooted at node j become independent if the underlying 
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graph is a tree. This is the natural generalization of the cavity method used for deriving BP (see [18|]) to dynamical 
processes. Notice that the auxiliary dynamics Dj is identical to the original dynamics D for all times such that 
Qj(t) = S- We also define an auxiliary dynamics Dij in which the state of a pair of neighboring sites i and j is always 
S. 

In order to close the system of message-passing equations, we write the remaining update rules in terms of three 
kinds of cavity messages, defined as follows: 

• Q k ^ % (t) is the probability that the infection signal has not been passed from the node k to the node i up to time 
t in the dynamics D^. 

6 k ^ l {t) = Prob Dl d k ^\t') = O^J : (A3) 

• (f> k ^ % (t) is the probability that the infection signal has not been passed from the node k to the node i up to time 
t in the dynamics Di and that k is in the the state / at time t: 

<p k ^{t) = Prob^ (j2^(t>) = 0, q k (t) = Ij ; (A4) 

• Pjs^it) 1S the probability in the dynamics Di that k is in the the state S at time t: 

P^ l (t) =Prob A (q k (t) = S). (A5) 

In what follows, we prove that 

Py j (t+l) = Pl(0) J] fc -*(t+l), (A6) 

kedi\j 

where di\j means the set of neighbors of i excluding j. Indeed, by definition 

P^ j (t + 1) = Prob^ {q t {t + 1) = S) = P|(0) Prob^ j ^ d k ^ l {t') I . (A7) 

\kedi\j t'=0 J 

Since the auxiliary dynamics coincides with dynamics Dj as long as the node i is in the S state, we can write 

pr\ t + 1) = pko) Prob^ f E dfc ^') ) • ( A8 ) 

\kedi\j t'=o / 

Since different branches of the graph containing nodes k G di\j are connected only through the node i, they are 
independent of each other, hence 

P^(t + 1) = Pl(0) Prob 1 ^ ( ]T d*~V) J , (A9) 

fce9i\.7 \t'=o / 

Moreover, for the nodes fc S the dynamics is equivalent to the dynamics Di, so we can replace by -D^ in 
the last expression and hence, using the definition (| A3|) . we obtain equation (|A6|) . 

We complete the updating rules by writing the equations for 9 k ^ 1 (t) and <f> ^ l {t). The only way 9 k ^ l (t) can 
decrease is by actually transmitting the infection signal from k to i, and this happens with rate Xki proportionally to 
the probability that the site k was infected, so we have 

e k ^ l {t + 1) - e k ^ l (t) = -\ kl <p k ^ l {t). (Aio) 

The change for fc ^ 4 (i) at each time step comes from three different possibilities: either the node k actually sends 
the infection signal to i (with probability X k i), either it recovers (with probability fi k ), or it switches to / at this time 
step, being previously in the S state (this happens with probability S' 1 ^ 3 (t — 1) — S 1- *^ (i)): 

^(t) - ^{t - 1) = -X kl ^(t - 1) - /i fc fc -*(t - 1) + X kt ^ k ^(t - 1) + S k ^(t - 1) - S k ^(t). (All) 
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The third compensation term on the right hand side of the previous equation comes to avoid counting twice the 
situation when the node k transmits the infection and recovers at the same time step. This completes the update 
rules for cavity messages. These equations can be iterated in time starting from initial conditions for cavity messages: 

(9^(0) = 1, (A12) 
^(0) = S qdo)J . (A13) 

The marginal probability in the original dynamics D is obtained by including all the neighbors k € di in eq. (|A6|) : 

= Ps(0) Y[ k ^ l (t + l). (A14) 

The closed set of equations (|A1IA2IA6IA10IA11IA14|) , together with the initial conditions (|A12flA13|) . give the exact 
values of marginal probabilities Pg(t), P}(t) and -Pjj(i) on a tree graph. 



Appendix B: The centrality algorithms for incomplete snapshots 

In the case where the state of all the nodes is known at time to? the centrality algorithms work on a connected 
component Q of infected and recovered nodes. In practice the information is available only for a fraction 1 — £ of nodes 
in the graph G. The snapshot 0(io) can then be thought of as a configuration of (1 — £)A nodes in the states S, I, 
R (nodes for which we have the information), and of £7V randomly located nodes in the unknown state X. Now the 
infected and recovered nodes in general do not form a connected component and are located in several disconnected 
components, separated by the nodes in the unknown states A. Nevertheless, it is clear that not all the A-nodes have 
to be checked as possible candidates to be the actual source of infection. If the cluster of nodes in the X state is 
surrounded only by the S- nodes, this cluster is clearly in the 5" state itself. Other A-nodes in principle are susceptible 
to be the infection source and thus need to be checked. 

We propose the following generalization of centrality algorithms for the £ ^ case. First we construct a connected 
component composed of all the nodes in the / and R states and clusters of X nodes which are not completely encircled 
by S'-nodcs. This gives a connected component of /, R and A nodes attached together. Since now we have a connected 
component Q, we can run centrality algorithms on it in a usual way. For £ = the connected component constructed 
in this way coincides with a connected component composed of infected and recovered component. 

Below (figure @| we compare the distributions of ranks for DMP and Jordan estimators for different £ (£ = 0, 
£ = 0.5 and £ = 0.9) in the special case of 'deterministic' recovery fi = 1 for A = 0.5 and A = 0.7. The results are 
presented for a regular random graph composed of N = 1000 nodes with connectivity c = 4, and we take to = 10. 
The plot shows how often the rank of the actual epidemic origin io is within the value of the corresponding bin (0% 
means exact reconstruction). According to the histogram, in 60% of cases we manage to locate the true infection 
source within 10% of relevant nodes (those situated in Q) for £ = 0. This number falls to 40% for £ = 0.9, when the 
states of only 10% of nodes in the network are known. 

We see that although for £ = the rank distribution based on the Jordan centrality estimator gives better results 
(in the case A = 0.5), it is no longer efficient when the number of unknown nodes gets larger (for all £ > 0.4). The 
dependence on £ for the case A = 0.7 follows the same patterns. 
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a) f = 0. A = 0.5 



b) f = 0, A = 0.7 
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FIG. 4. (color online) Distribution of inferred rank of the epidemic origin measured over the graph Q for Jordan centrality 
estimator (yellow) and DMP estimator (red) on regular random graphs: a) £ = 0, A = 0.5, b) £ = 0, A = 0.7, c) £ = 0.5, A = 0.5, 
d) £ = 0.5, A = 0.7, e) £ = 0.9, A = 0.5, f ) £ = 0.9, A = 0.7. The average is performed over 500 instances. 



